A conversation with the Utah Office of AI Policy research team.
I know it when I see it. I can evaluate any individual case. I can't yet write the rules that would let someone else do what my team does.— Zach Boyd, CHAI 2026
The sandbox is at the stage where every approval depends on staff judgment. The next phase is turning case review into standards — and that transition needs a shared evidence base, not just more reviewers.
‘Human in the loop’ is an allocation problem. When can a clinician responsibly delegate ninety to one hundred percent of a decision?— Zach Boyd, CHAI 2026
The honest answer requires evidence of what the system actually did — not a vendor self-report, not a policy document, not a monitoring dashboard the vendor also owns.
Once standards are set, they ossify. The underlying technology keeps moving. The standard stops moving with it.— Zach Boyd, CHAI 2026
Standards at the policy layer. Evidence at the runtime layer. One can evolve without re-certifying the other — which is how a sandbox stays current as the systems it governs change underneath.
Was the declared control actually running at the moment of the decision?
Did the system's behavior drift from its baseline over time — and if so, when?
Can an independent auditor verify the first two offline, years later, without trusting us or the vendor?
Three components. Each one answers one of the questions on the previous slide. Each one is independently verifiable.
Every inference call is wrapped. A cryptographic receipt is emitted with the inputs, the declared policy, and the output — signed at the moment of execution.
A behavioral baseline is established and watched. Distributional shift is caught as it happens — not weeks later in a retrospective review.
A Python verifier reads the receipt stream and confirms the chain. Your auditor verifies our evidence without trusting us, our servers, or our certificates.
Every sandbox participant generates structured, comparable evidence against the same schema.
Your office moves from staff-judgment approvals toward generalizable rules, at the pace your team decides — not at the pace of the next hiring cycle.
Participants carry their attestation record with them when they deploy downstream.
Not a credentials slide. Where our evidence comes from, in one breath.
Instrument one existing sandbox participant with our evidence layer.
Zero cost. Zero PHI egress. Zero procurement process.
Monthly evidence report your team can independently verify — and in return, we learn what format is most useful to your review process.
One sentence per stage. The demo itself follows.
The participant states the policy the system is supposed to follow, in machine-readable form.
We establish the behavioral baseline from real, consented traffic before anything goes live.
Every inference call in production is wrapped and the declared control is enforced at runtime.
Drift from the baseline is flagged as it happens, with the evidence attached.
Your auditor re-runs the verifier offline and confirms the record, months or years later.